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IGNORABILITY FOR CATEGORICAL DATA 

By Manfred Jaeger 

Aalborg Universitet 

We study the problem of ignorability in likelihood-based infer- 
ence from incomplete categorical data. Two versions of the coarsened 
at random assumption (car) are distinguished, their compatibility 
with the parameter distinctness assumption is investigated and sev- 
eral conditions for ignorability that do not require an extra parameter 
distinctness assumption are established. 

It is shown that car assumptions have quite different implications 
depending on whether the underlying complete-data model is satu- 
rated or parametric. In the latter case, car assumptions can become 
inconsistent with observed data. 

1. Introduction. In a sequence of papers Rubin [15], Heitjan and Ru- 
bin [11] and Heitjan [9, 10] have investigated the question under what con- 
ditions a mechanism that causes observed data to be incomplete or, more 
generally, coarse, can be ignored in the statistical analysis of the data. The 
key condition that has been identified is that the data should be missing at 
random (mar), respectively, coarsened at random (car). Similar conditions 
were independently proposed by Dawid and Dickey [4]. A second condition 
needed in Rubin's [15] derivation of ignorability is parameter distinctness 
(pd). 

A case of particular practical interest is the one of incomplete or coarse 
categorical data. Traditionally associated with the analysis of contingency 
tables in terms of log-linear models, categorical data today also plays an 
important role in learning probabilistic models for artificial intelligence ap- 
plications [12]. For these applications graphical models or Bayesian networks 
are used [2, 3, 13]. Incomplete data here is particularly prevalent, and the 
analysis of Rubin and Heitjan is widely cited in the field. 

In this paper we take a closer look at the way ignorability is established 
for likelihood-based inference through the car and pd assumptions. It is 
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found that one has to distinguish a weak version of car that is given as a 
condition on the joint distribution of complete and coarse data, and a strong 
version of car that is given as a condition on the conditional distribution of 
the coarse data. The two versions of car lead to quite different theoretical 
results and practical implications for likelihood-based inference. We consider 
in detail the dependencies between the car and the pd assumptions, and find 
that for weak car these two assumptions are incompatible unless further 
assumptions on the parameter of interest, or on the coarsening process, 
are made. In contrast, pd is implied by strong car (Section 3). For the 
case of an underlying saturated complete-data model ignorability results 
can be derived from weak car alone without making the pd assumption. 
Our main result identifies the maxima of the observed-data likelihood under 
either car assumption as exactly those complete-data distributions that are 
compatible with the car assumption and the observed data (Section 4.1). 
For nonsaturated complete-data models no analogous results hold. Even for 
very simple parametric models car becomes a testable assumption that can 
be rejected against an alternative hypothesis (Section 4.2). 

2. Coarse data models. We use a very general and abstract model for 
categorical data: complete data is taken to consist of realizations x±, . . . ,xn 
of independent identically distributed random variables X±, . . . , that 
take values in a finite set W = {w±, . . . , w n }. The Wi can be the cells of 
a contingency table, for instance. The distribution of the Xi is assumed to 
belong to a parametric family {Pq\6 £ 0}, where C M. k for some fcGN. For 
this paper the analytic form of a parametric family will not be important, 
and only the subset of distributions contained in the family is relevant. For 
that reason we may generally assume that 

0CA":={( Pl ,...,p n )e[O,l] n |^p 4 = l} 

with 

P e (Wi)=Pi, 9 = (pi, . . . ,p n ) £ 0. 

Any C A n is called a complete- data model. = A n is the saturated 
complete-data model. In the saturated model, as well as in most of the 
important parametric models for categorical data (e.g., log-linear models), 
different parameters 9,9' may define distributions Pe,Pe' with different sets 
of support. Most of the results of this paper address difficulties that arise 
out of this. 

When data is incomplete, then the exact value X{ of Xi is not observed. 
According to the general coarse data model of Heitjan and Rubin [11] one 
observes instead a subset XJ% of W. More specifically, Heitjan and Rubin 
model coarse data by introducing additional coarsening variables Gi, and 
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Fig. 1. Coarse data space. 



taking Ui to be a function Y(xi,gi) of the complete data xi and the value 
gi of the coarsening variable. In the following definition we take a slightly 
different approach, and model the coarsening process directly by a joint 
distribution of Xj and the observed coarse data Ui. For categorical data 
this is simpler, and avoids a sometimes artificial construction of a suitable 
coarsening variable. 

Definition 2.1. Let W = {101, . . . ,w n }. The coarse data space for W is 
n(W) := {O, £/) |io G W, C/ C W : w G C/}. 

When specific reference to W is not needed, we write £1 for Q(W). An 
element (10, U) G stands for the event that the true value of a VF-valued 
random variable X is w and the coarse value U is observed. A subset U QW 
defines two different subsets in Q: On := {(w,U) GQ\w:wGU}, which is the 
event that U is observed, and the event {(w, U') G Q\w, U' :w G U} that the 
value of X lies in U (and some U' is observed). This latter subset of 0, is 
simply denoted by U, and is not strictly distinguished from U as an event in 
the sample space W. Figure 1 illustrates these definitions for a three-element 
complete-data space W = {101,102,103}- The elements of Q(W) correspond 
to the unfilled cells in this graphical representation. For U = {102,103} the 
events Ojj and U (as a subset of £1) are outlined. 

A distribution P on f2 is parameterized by the parameters 6 defining the 
marginal distribution on W, and parameters 

\ w ,u:=P({w,U)\w), (w,U)en, 

defining the coarsening process. 

Example 2.2. Table 1 specifies distributions pW, i = 1,2,3, on 0( {101,102,103}) 
through parameters 8^ on W and conditional probabilities A W . For w with 
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Table 1 

Parameters for distributions P^, , P 







0(i) 


{™l} 


{w 2 } 


{w 3 } 


{w 1 ,w 2 } 


{w 1 ,w 3 } 


{w 2 ,w 3 } 


{w 1 ,w 2 ,w 3 } 


i = 1 


Wl 





[1/3] 






[1/3] 


[0] 




[1/3] 




U)2 


1 









1/3 




1/3 


1/3 




W3 









[1/3] 




[0] 


[1/3] 


[1/3] 


i = 2 


Wl 


1/2 









2/3 







1/3 




W-2 







[1/3] 




[2/3] 




[0] 


[0] 




W3 


1/2 














2/3 


1/3 


i = 3 


Wl 


1/3 


1/3 






1/3 







1/3 




W-2 


1/3 









1/3 




1/3 


1/3 






1/3 






1/3 







1/3 


1/3 



Pg(w) = parameters X w ^u are shown in brackets. Changing these parame- 
ters to arbitrary other values just leads to a different version of the condi- 
tional distribution of coarse observations given complete data, and has no 
influence on the joint distribution. 



As in this example, we generally assume that parameters \ w ^u exist even 
when Pq{w) = (rather than treating them as undefined), because in that 
way the parameter space A n for the A-parameters does not depend on 9: 

A n := s (^w,u)weW,UCW: w€u\^w,u S [0, 1]; Vw : 2J A t0j [/ = 1>. 

I U:weU J 

Any subset £ C A n x A n is called a coarse data model. Such a model 
encodes assumptions both on the underlying complete data distribution and 
on the coarsening process. The complete-data model underlying S is 

Q = {9eA n \3\:(8,\)eZ}. 

We sometimes write X(0) for £ to emphasize the underlying complete-data 
model. We denote with T lsa t(@) = 6x A™ the saturated coarsening model 
with underlying 0. 

A sample of coarse data items U = U±, . . . ,Un (Ui ^ W) is interpreted 
with respect to a coarse data model as observations of events Ojj i in f2, and 
gives rise to the observed-data likelihood for 9 and A, 



(1) 
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When ignoring the coarsening process, the data items Ui are simply inter- 
preted as subsets of W and give rise to the face-value likelihood [4] for 9, 

N 

(2) L F v(0|W):=nWi)- 

i=l 

3. Ignorability. The question of ignorability is under what conditions 
inferences about 8 based on the face-value likelihood will be the same as 
obtained from the observed-data likelihood. These conditions will depend on 
the inference methods used [15]. Here we focus on the problem of ignorability 
for likelihood-based inference, with special emphasis on maximum likelihood 
estimation, which plays an important role in practice through the widespread 
use of the EM algorithm [6, 14]. 

For likelihood-based inference about 8, the observed-data likelihood will 
typically be reduced to the profile-likelihood 

(3) Lp, s (0\U):= max L OD (M|W). 

A:(0,A)e£ 

To make the profile-likelihood well defined for all 8, we restrict ourselves 
to models S for which {A|(0, A) E £} is closed for every 8 E 0, so that the 
maximum in (3) is attained. In our notation we make explicit that the profile- 
likelihood is not only a function of 8 and U, but also of the coarse data model 
E. 

Moving from the observed-data likelihood to the profile-likelihood enables 
us to treat inference both with and without taking the coarsening process 
into account as inference with a likelihood function of only the parameter 
of interest, 8. In particular, we obtain succinct formulations of ignorability 
questions: under what conditions on £ are likelihood ratios Lp t Y,{8) / 'Lp^(8') 
and Lfv(0) /Lfv(6') equal for all 8,8'; under what conditions are Lp,s and 
Lpv maximized by the same values 8 E 0? 

In the following we formulate the car and parameter distinctness assump- 
tions as such modeling assumptions on E. In the case of car it turns out 
that we must distinguish two different versions. 

Definition 3.1. The data is weakly coarsened at random (w-car) ac- 
cording to P$ x, if for all U C W and all w,w' EC/ 

(4) P e (w) > 0, P e (w') > => \ w ,u = X w , tU . 

Definition 3.2. The data is strongly coarsened at random (s-car) ac- 
cording to Pg t \, if for all U C W and all w,w' E U 



(5) 



G 



M. JAEGER 



The difference between weak and strong car, thus, is that s-car also im- 
poses a restriction on conditional probabilities Pg t \(Ojj\w) when P(w) = 0. 
This is the version of car used by Gill, van der Laan and Robins [7] for 
categorical data. Underlying this version of car is the notion of car being 
a condition on the coarsening mechanism alone, which must be formulated 
without reference to the underlying complete-data distribution. Weak car, 
on the other hand, appears to be the more appropriate version when car is 
seen as a condition on the joint distribution of complete and coarsened data. 

Gill, van der Laan and Robins ([7], page 274) also give a definition for 
car in general sample spaces. In contrast to their definitions in the discrete 
setup, that definition reduces for finite sample spaces to w-car, not s-car. 
They pose as an open problem whether (in the terminology established by 
our preceding definitions) it is always possible to turn a w-car model into an 
s-car model by a suitable setting of the A^^-parameters for those w with 
Pe{w) = 0. Our next example shows that this is not the case. 

Example 3.3. All distributions in Table 1 are w-car, but only P^> and P^ 
are s-car: to check the w-car condition it only is necessary to verify that all 
unbracketed X Wt u in a column are pairwise equal. For s-car also equality of 
the bracketed parameters is required. This condition is violated in the last 
two columns for P^ 2 \ Moreover, it is not possible to replace the bracketed 
A( 2 )-values with different conditional probabilities in a way that s-car is sat- 
isfied, because the conditional probabilities for the observations {wi,W2}, 
{w2,w^} and {w\,W2, w%} would have to add up to 5/3. 

In the following we write car when we wish to refer uniformly to both 
versions of car, for example, in definitions that can be analogously given for 
both versions, or in statements that hold for both versions. 

When Pg x satisfies car we denote parameters X w jj simply with Xjj. In 
the case of w-car this denotes the parameter \ W) u common for all w of posi- 
tive probability. When Pg(U) = 0, then Xu is not well defined for w-car. We 
denote with E car (0) the subset of E sa t(0) consisting of those parameters 
according to which the data is car. For 6 £ we denote with A w - car {9) 
the set of A £ A" that satisfy (4). Thus, $V car (0) = {(0,X)\9 G 0, A € 
A-w-car{9)}- From Definition 3.1 it follows that support^) C support(i-0/) 
implies A w . car (6) D A w . car (6'). For s-car we can simply define the set A s . car 
of coarsening parameters that satisfy (5), and have £ s - car (0) = x A s - car . 

The following definition provides an important alternative characteriza- 
tion of w-car. 

Definition 3.4. Pg^x satisfies the fair evidence condition if for all w,U 
with w € £7, 

(6) P e ,x(Ou)>0 P e ,xHOu) = P e (w\U). 
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The fair evidence condition is necessary to justify updating a probability 
distribution by conditioning when an observation is made that establishes 
the actual state to be a member of U [8] . We now obtain: 

Theorem 3.5. The following are equivalent for Pg t \: 

(a) Pg : \ satisfies w-car. 

(b) Pe,\ satisfies the fair evidence condition. 

(c) For all w, U with w G £/ and Pg(w) > 0, 

Pe,x(O u \ w )=P e>x (O u )/P e (U). 

Proof. (a)=*(b): If Pg,x(Ou\w) = Pe,x{Ou\w') for all w,w' G U with 
Pq(w),Pq(w') > 0, then this value is equal to Pg t x(Ou\U). Assume that 
Pg(w) > [otherwise there is nothing to show for (6)]. Using Pg \{U\Ojj) = 1, 
then P e>x (w\Ou) = Pg, x (O u \w)Pg(w)/Pg tX (O u ) = Pe t x(O u \U)P d (w)/P 9iX (O u ) = 

Pg, X (U\Ou)Pg(w)/Pg(U) = Pg(w\U). 

(b) =>(c): Let w G U with Pg(w) > 0. Then Pg^{Ou\w) = Pg, X (w\Ou) x 
Pe,x{Ou)/Pe(w)=Pe(w\U)Pe,x{Ou)/Pe(w) = Pe,x\o u )/P e (U). ' 

(c) =>(a): Obvious. □ 

Example 3.6. To check the fair evidence condition for the distributions 
of Table 1, one has to verify that for each observation Ojj, normalizing all 
nonbracketed entries in the A-column for Ojj yields the conditional distri- 
bution of Pg on U. 

One might suspect that one can also obtain a "strong fair evidence con- 
dition" by considering the normalization of both the bracketed and the un- 
bracketed A-entries, and that this strong version of the fair evidence condi- 
tion would be equivalent to s-car. However, already for P^ (which is s-car), 
we see that for U = {wi,w%} the normalization of the column for Ojj gives 
(1/2,1/2) on U, which is not Pg{-\U). 

Gill, van der Laan and Robins ([7], page 260) claim the equivalence of the 
fair evidence condition and s-car. However, as our results show, fair evidence 
is equivalent to w-car, not s-car. (The error in the proof of Gill, van der Laan 
and Robins [7] lies in an (implicit) application of Bayes rule to conditioning 
events of zero probability.) A correct proof of the equivalence (a)^=>(b) also 
is given by Griinwald and Halpern [8]. We consider the equivalence with 
the fair evidence condition to be an important point in favor of w-car as 
opposed to s-car. 

Weak and strong car are modeling assumptions that identify certain 
coarse data distributions for inclusion in our model. The second condition 
usually required for ignorability, parameter distinctness, on the other hand, 
is a global condition on the structure of the coarse data model. 
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Definition 3.7. A coarse data model £ satisfies parameter distinctness 
( pd) iff £ = x A for some 9 C A n , A C A™. 

From car and pd ignorability for likelihood-based inference can be derived. 
We next restate Rubin's proof of this result, in a way that clearly separates 
the contributions made by car and pd. To begin, assume that S C S mr , and 
let (0, A) E S. Let U be a sample with P e (Ui) > for i = 1, . . . , N. Now 

N N 

L OB (e,\\U) = l[Pe,x(Ou i ) = l[ E Pe,x(M)) 

i=l i=l w&Ui 

N N 

i=l w£Ui i=l 

Thus 

(7) L P ^(e\u) = c s (e,u)L F v(e\u), 

where 

N 

(8) := i:fc nV,. 

Now assume, too, that pd holds, that is, £ = x A. The right-hand side of 
(8) then simply becomes max^ g A YliLi , which no longer depends on 9. 
Lp^ and Lpy, thus, differ only by a constant, so that inferences based on 
likelihood ratios of Lpv are justified. 

This derivation also provides the answer to a somewhat subtle question 
that arises out of our analysis so far: we have assumed throughout that 
the coarse data will be analyzed correctly in the coarse data space us- 
ing the observed-data likelihood Lod- However, interpreting the data in 
means that we still are dealing with coarse data, because it now is seen to 
consist of observations of subsets Ojj of £1, not of complete observations 
(w, U) £ Q. The question then is whether we have gained anything: Lob re- 
ally is nothing but the face- value likelihood of incomplete data in the more 
sophisticated complete-data space f2. Do we thus have to build a second- 
order coarse data model on top of 0, and so on? The answer is no, because 
the coarsening process that turns complete data (w, U) from f2 into coarse 
observations Ojj always is ignorable: in the second-order coarsening model 
we have \ w ,U'),Ou = 1 iff U' = U, which means that here the data is car, 
and the factor c(6,U) in (7) is always equal to 1. 

How can this ignorability result be used in practice? In most cases it is 
appealed to simply by stating that the car and pd assumptions are made, 
and that this justifies the use of the face-value likelihood. This, however, is 
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a rather incomplete justification, because car and pd together are not well- 
defined modeling assumptions that determine a unique coarse data model S. 
To make the car and pd assumptions only means to assume that the coarse 
data model £ is some subset of X car (0), and has product form 0' x A'. 
In the case of w-car, nontrivial further modeling assumptions may have to 
be made to ensure that pd holds, because T, w - car (@) itself usually is not a 
product. The following example illustrates the consequences for likelihood- 
based inferences under w-car. From now on we write Lp )Car and c car for 
L P,z car (e), respectively c Scar ( 6 ), and similarly for T, sat (@). The underlying 
will always be clear from the context. 

Example 3.8. Let 0®, i = 1,2,3, be as in Example 2.2. Let U be a 
sample consisting of U\ = {wi,^}, U2 = {^2,^3}, U3 = {w±, W2, W3}. It is 
readily verified that for i = 1, 2, 3, 

that is, the coarsening parameters given in Table 1 maximize • Xjj 2 ■ ^u 3 
over all parameters in A w - car (6 l ). It also follows immediately that c s - car {9^\U) = 
(1/3) 3 = 1/27. With these c car -values one obtains the likelihood values shown 
in Table 2. 

The first two columns of Table 2 show that likelihood ratios of Lpy an d 
Lp tW -car do not coincide. Also the weaker ignor ability condition of identical 

likelihood maxima does not apply: Lp w - car has the two maxima and 
p(2) ( Th 

eorem 4.4 below will show that these are indeed global maxima of 
Lp tW -car), but of these only also maximizes Lfv- It is n °t surprising 
that ignorability here cannot be established on the basis of w-car alone, 
because T, w - car does not satisfy pd, and hence the factors c w - car (9,U) in (7) 
are different for different 9. However, in Section 4.1 we will see that even on 
the basis of w-car alone a useful ignorability result can be obtained. 

The s-car assumption, on the other hand, yields ignorability in the strong 
sense of equal likelihood ratios, because S s _ car = x A s - car satisfies pd. 

We thus obtain the following picture on the interdependence between the 
car and pd assumptions: s-car as the only modeling assumption on the coars- 
ening process implies pd. To obtain ignorability, it therefore is sufficient to 



Table 2 
Likelihood values 



i 


£ FV (6> W |W) 


L P , w - aa r(O m \U) 


L P , s - car (0^\U) 


1 


1 


1 • 1/27 


1 ■ 1/27 


2 


1/4 


1/4-4/27 


1/4-1/27 


3 


4/9 


4/9-1/27 


4/9 • 1/27 
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stipulate s-car. When one stipulates w-car as a modeling assumption, then 
additional assumptions are required to make the resulting model also sat- 
isfy pd. It must be realized that pd is itself not a well-defined modeling 
assumption, because it does not identify any particular subset of distribu- 
tions for inclusion in the model. A joint assumption of w-car and pd only 
is possible if suitable further restrictions on either the complete-data model 
G or on the coarsening process are made. One possible restriction on is 
to assume a fixed set of support for the distributions P$. If, for example, 
6 C {6»| support(P 9 ) = W}, then £ 

w-ear(©) has pd. However, in most cases it 
is not possible to determine a priori the set of support of a categorical data 
distribution under investigation, and hence models allowing for different sets 
of support have to be used. 

A further assumption one can make on the coarsening mechanism is that 
the data is completely coarsened at random (ccar) [9]. We do not give the 
precise definitions here, but only note that £ ccar (0) C X s - car (0) for any O, 
and that S ccar (G) has pd. Thus ccar, too, guarantees ignorability when it is 
the only modeling assumption on the coarsening mechanism. However, ccar 
is considered to be an unrealistically strong assumption for most applica- 
tions. 

4. Ignorability without parameter distinctness. In the preceding section 
we have seen that standard ignorability conditions cannot be established 
from the w-car assumption alone, because T, w - car does not have pd. In this 
section we pursue the question whether some ignorability results can nev- 
ertheless be obtained from w-car. It turns out that in the case of the sat- 
urated complete-data model O = A n a fairly strong ignorability result for 
maximum likelihood inference can be obtained (Section 4.1). For nonsatu- 
rated complete-data models s-car is needed for ignorability. However, with 
nonsaturated models car becomes a testable assumption that, based on the 
observed data, may have to be rejected against the not-car alternative (Sec- 
tion 4.2). 

The following simple lemma pertains to both saturated and nonsaturated 
complete-data models. For the formulation of the lemma we introduce the 
notation c w - car (V, U) for c w - car (6,U), where 9 € G is such that support(Pg) — 
V C W. As c w - car (9,U) depends on 9 only through support(Pg), this is 
unambiguous. Identity (7) then becomes 



(9) 




The following lemma is immediate from the definitions. 



Lemma 4.1. Let V C V 1 C W. Then c. 



■w-car 



(V,U)>G 



w-car 



iy'M). 
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4.1. The saturated model. For this section let O = A n be the saturated 
complete-data model. We immediately obtain a weak ignorability result. 

Theorem 4.2. Let 9 G A n be a local maximum of Lfv(-\U). Then 9 is 
also a local maximum of Lp iW . car (-\U). 

Proof. Let 9 be a local maximum of Lpy(9\U) . There exists a neighbor- 
hood of 9 such that for every 6 G we have support(Pg) 2 support(P^). 
With (9) and Lemma 4.1 the theorem then follows. □ 

We next show that local maxima of Lpv are, in fact, global maxima 
of Lp tW - car , thus establishing ignorability in a strong sense for maximum 
likelihood inference in the saturated model. For the characterization of the 
maxima of Lp )W - C ar the following definitions are needed. The terminology 
here is adopted from [5]. 

Definition 4.3. Let VL{W) be as in Definition 2.1. We denote with O 
the partition {Ojj\0 ^ U C W} of f2. Let m be a probability distribution 
on O and let Pg be a probability distribution on W. We say that m and Pg 
are compatible, written m ~ Pg, if there exist parameters A G A", such that 
Pe,\{Ou) = rn(Ou) for all Of/ € 0. We say that m and Pq are car- compatible 
(written m ~ C ar -Pe) if there exists such a AG A car (9). 

Theorem 4.4. Let U be a set of data, and let m be the empirical dis- 
tribution induced by U on O. For 9 G A n with support(P^) = VQW the 
following are equivalent: 

(a) m~ w - car P^. 

(b) 6 is a global maximum of Lp^ w - car (9\U) in A n . 

(c) Lpy(9\U) > 0, and 9 is a local maximum of Lpy(9\U) within {9 G 
A n [support(P e ) = V}. 

Theorem 4.4 establishes ignorability for maximum likelihood inference in 
a slightly different version from our original formulation in Section 3: it is 
not the case that Lp^ w - car and Lfv are (globally) maximized by the same 
9 G 0; however, maximization of Lfv wm nevertheless produce the desired 
maxima of Lp^ w - car , and, moreover, only a local maximum of L-py must be 
found. 

The proof of the theorem is preceded by two lemmas. The first one charac- 
terizes maxima of the observed-data likelihood in the saturated coarse data 
model. 

Lemma 4.5. Let U and m be as in Theorem 4.4. For 9 G A n then the 
following are equivalent: 



12 M. JAEGER 

(i) m~P § . 

(ii) 9 is a global maximum of Lp sa t{9\U). 

Proof. The likelihood Lq,j)(9,\\U) only depends on the marginal of 
Pq x on O, and is thus maximized whenever this marginal agrees with the 
empirical distribution. 

The equivalence (i)44>(ii) follows, because for every empirical distribu- 
tion m there exists at least one parameter (9, A) € T, sat (A n ) such that the 
marginal of P§ ^ on O is m. □ 

Lemma 4.6. The following are equivalent: 

(i) m~ w - car Pg. 

(ii) ForaUw€W:P e (w)>O^Eu: W eUT^ffy = ^ 
The proof follows easily from Theorem 3.5. 

Proof of Theorem 4.4. (a)=>(b): m~ w _ car P^ trivially implies m~ 
Pfi. By Lemma 4.5 Lp )Sat is maximized by 9. Also, Lp :Sat (9\U) > Lp tW - car (9\U) 
with equality for 9 = 9. Hence 9 maximizes Lp )W - car . 

(b) =^(c): Immediate from (9). 

(c) =^(a): Recall that 9 € A n is written as 9 = (pi, . . . ,p n ) withpj = Pg(wi). 
Let D := {9 € A n \L FY (9\U) > 0}. For 9 G D then 

^logL FV (^) = E m{Ou)\ogP 6 {U) 

UCW:m(Ou)>0 

E m(Otf) log I E Pi)- 

This is differentiable on D, and with U{uii) := {[/ C W\m(Ojj) > 0,Wi <EU}, 

A ueu(wi) \j:w 3 £U ) ueu(wi) e{ > 

[the sum on the right-hand side being interpreted as when U{u)i) = 0)]. 
For 9 as in (c) we have that S(V) := {9 € A n | support^) =y}CD, and 
the gradient of (1/iV) logLpv(^l^) is orthogonal to S(V) at 9. This can be 
expressed as the condition that for every 9' = (p^, . . . ,p' n ) E S(V) 

t( £ 7§?)«-»<>=* 
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which is equivalent to 



E E E E 

i:wiev\ueu(wi) e y ' I i-. Wi ev \u&A{ Wi ) o y ' 1 

This implies that X^gw^) m (0[/)/Pg([/) is a constant k that does not 
depend on Wi, and furthermore 

k = k e ft= E E E E S 

= MOu) = i- 

U:m(Ou)>0 

Now (a) follows from Lemma 4.6. □ 



We remark that (a)-(c) in Theorem 4.4 also are equivalent to: 
(d) 9 is a stationary point for the EM algorithm. 

We do not give a formal proof here, but emphasize that in the current 
context we assume the saturated complete-data model, and thus in (d) also 
assume that the EM algorithm operates on the unrestricted parameter space 
A n . Then the equivalence (a)44>(d) easily follows from the equivalence of w- 
car and the fair evidence condition. 

For s-car one obtains the following analogue of Theorem 4.4. 

Theorem 4.7. Let U, m be as in Theorem 4.4. For 9 € A™ the following 
are equivalent: 

(a) m~ s _ car P~. 

(b) 9 is a global maximum of Lp )S - car (9\U) in A n . 

(c) 9 is a global maximum of L-pv(9\U) in A n . 



The equivalence (b)<^(c) here is immediate from the equality of likelihood 
ratios of Lp tS - car and Lpy. The nontrivial implication is (c)=^(a). It has 
(implicitly) been shown by Gill, van der Laan and Robins [7] in the proof of 
their first theorem. 



Example 4.8. Let U be as in Example 3.8. Then m{Pu^) = 1/3 for 
i = 1, 2, 3. and P^> are w-car distributions with marginal m on O. By 
Theorem 4.4, 6»W and 9^ are global maxima of Lp w - car . PW also is s-car, 
and therefore 9^ is a global maximum of Lp s ~ car . 
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In the preceding example we found a single maximum of Lp tS . car , and 
two distinct maxima of Lp tW - car . Gill, van der Laan and Robins [7] showed 
that for every m there exists 9 with m ~ s -car Pe, and 9 is essentially unique 
[for any 9' with m ~ S - C ar Po' h must be the case that P$'(U) = Pe{U) for 
all U €14]. Thus Lp s . car has an essentially unique maximum. For w-carwe 
obtain the following result on the existence of maxima of Lp w . car . 

Theorem 4.9. Let U and m be as in Theorem 4.4. Let V C W such 
that for allUCW 

(10) m(Ou) >0 =^> VDU^0. 

Then there exists 9 with support(P^) C V and m ~ M - car Pg. 

PROOF. From (10) it follows that Lfw(9\U) > for 9 with support (Pg) = 
V. In particular, Lpy(9\U) is not identically zero on the compact set {9\ support(Pg) C 
V} and attains a positive maximum at some 9. The theorem now follows 
from Theorem 4.4. □ 

Theorem 4.4 in conjunction with Lemma 4.5 provides yet another ignora- 
bility result: maximization of Lpv will yield a parameter 9 with m ~ Pa, and 
thus a global maximum of Lp sa t. Thus, the use of the face- value likelihood 
instead of the observed-data likelihood also is justified when we assume the 
saturated coarse data model S sai (A n ). In other words, ignorability holds 
when the coarsening process is treated as completely unknown (and the sat- 
urated model also is assumed for the underlying complete data). However, 
it turns out that ignorability is not really the issue here, as maximum like- 
lihood solutions for the observed-data likelihood in the model E^^A") can 
be found directly without optimizing Lpy. Dempster [5] gives an explicit 
construction of the set {9 E A n \m ~ Pq}, which briefly is as follows. 

Consider any ordering , Wi 2 , . . . , Wi n of the elements of W. Now trans- 
form the coarse data U\, . . . ,Un into a sample of complete-data items by 
interpreting Uj as an observation of the first Wi h in the given ordering that 
is an element of Uj. Let Pq be the empirical distribution of this completed 
sample. By considering all possible orderings of W, one obtains in this way 
distributions Pq 1 , . . . , Pe n , on W. The set {9 E A n \m ~ Pq} now is the convex 
hull of all these P$ i . Moreover, the empirical distribution of any completion 
of the data lies in the convex hull of the Pg i . 

It thus is very easy to directly determine some maximal likelihood solu- 
tions of Lp tSa t, simply as the empirical distribution of an arbitrary comple- 
tion of the data. An explicit representation of all solutions is obtained by 
computing all Pe i ■ 
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The problem here is that the set {9 € A n |m ~ Pg} typically will be very 
large (much larger than the set {9 6 A n \m ^ C ar Pe}), an d therefore infer- 
ences based on the coarse data model S sa j(A n ) will be too weak for practical 
purposes. We thus see that making the car assumption really serves a sec- 
ond purpose besides justifying the use of the face-value likelihood: we need 
to make some assumptions on the coarsening mechanism, because otherwise 
our model will be too weak to support practically useful inferences. 

Figure 2 summarizes some of our results in terms of our running example. 
Shown is the polytope A 3 with the potential lines of the face-value likeli- 
hood Lpy(-\U) for U as in Example 3.8. The two distributions p( l \pW are 
marked by circles. They correspond to nonzero maxima of Lpy relative to 
distributions with the same set of support. Marked as diamonds are the dis- 
tributions obtained from the extremal data completions for the five possible 
orderings of W. Their convex hull is the set of 9 compatible with m. 

From the results of this section we can also retrieve Gill, van der Laan 
and Robins' [7] result that "car is everything," that is, the car assumption 
cannot be rejected against the not- car alternative based on any observed 
coarse data (assuming an underlying saturated complete-data model). This 
is because by Theorem 4.9 [and the corresponding result in [7] for s-car] there 
exists for any observed coarse data a car model with the observed marginal 
on O. Gill, van der Laan and Robins [7] show that the same need not hold 
for infinite sample spaces. Further results on the nontestability of the car 
assumption in general sample spaces have been obtained by Cator [1]. In 
the next section we will see that car also becomes testable for finite sample 
spaces with a parametric complete-data model. 

4.2. Nonsaturated models. Most of the preceding ignorability results are 
no longer valid when the complete-data model is not A n . Only the weak ig- 
norability result of Theorem 4.2 can be retained for a wide class of complete- 
data models. 




(0,0,1) 



(1,0,0) (0,1,0) 



Fig. 2. Summary of running example. 
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Definition 4.10. A complete-data model {Pq\9 £ 0} is support- continuous 
if for all 9 € there exists a neighborhood Gg C © such that support(Pg') D 
support(Pe) for all 9' GGg. 

Virtually all natural parametric models are support-continuous. The proof 
of Theorem 4.2 actually has established the following: 

Theorem 4.11. Let {Pg\9 £ 0} be a support- continuous complete- data 
model. Every local maximum 9 £ of Lpy(-\U) then is a local maximum of 

Lp t w-car{'\M) . 

The following example shows that other results of Section 4.1 cannot be 
extended to parametric models. 

Example 4.12. Let A and B be two binary random variables. Let W = 
{AB,AB, AB,AB}, where, for example, AB represents the state where 
A = 1 and B = 0. We represent a probability distribution P on W as the 
tuple (P(AB),P(AB), P(AB),P(AB)). Define 

e = {e = (a,6)|oe[o J i] J 6e[o,i]} J 

P Q = (ab, a(l - 6), (1 - a)(l - b), (1 - a)b). 

Now assume that the data U consists of six observations of A (i.e., the 
set {AB,AB}), three observations of B, three observations of B and one 
observation of AB. 
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Figure 3(a) shows a plot of L-pv{9\lA). We can numerically determine the 
unique maximum as 9 ph (0.845, 0.636), which corresponds to Pg « (0.54, 0.31, 0.05, 0.1). 
Restricted to the subset ©i := {9 G 0|O < a < 1, b = 1} a local maximum is 
attained at 9* « (0.69,1), which corresponds to P e * ps (0.69,0,0,0.31). 

The set 0i corresponds to the set of support V = {AB,AB} of P#> that 
is, 9 G 0i 4» support(P e ) = V. Similarly, the set 2 := {9 G 0|O < a < 1,0 < 
6 < 1} contains the parameters that define distributions with full set 
of support W. Lp )W - car (6\U), therefore, is given by multiplying L-p\[(9\U) 
by c w - car (V,U) when 9 G ©i, and by c w - car (W,U) when 9 G 02- For all 
9 ^ ©i U ©2 we obtain Lpv(^l^) = 0, so that further constants c w - car {V'\U) 
do not matter. The approximate values for the relevant constants are c w - car (V,U) ~ 
0.0003 and c w -ca r (W,U) ~ 0.0001. 

A plot of Lp^ w - car (9\U) as given by (9) is shown in Figure 3(b). Note 
the discontinuity at the boundary between ©i and ©2 due to the different 
factors c w - car (V,U) and c w - car (W,U). It turns out that the global maximum 
now is 9*, rather than 9. 

Theorem 4.4 allows us to analyze the situation more clearly. It is easy 
to see that P= (9/13,0,0,4/13) is a distribution that is w- car-compatible 
with the empirical distribution m induced by U. We find that P = Pg* for 
9* = (9/13, 1) = (0.6923, 1), which thus turns out to be the precise value of 
9* which initially was determined numerically. From Theorem 4.4 it now 
follows that Pg* has maximal Lp jlu - car -likelihood score even within the class 
of all distributions on W, so that not only is 9* a global maximum in 0, but 
no better solution can be found by changing the parametric complete-data 
model. 

Under the s-car assumption the maximum likelihood estimate is 9. Thus, 
the two versions of car here lead to quite different inferences. There also 
is a fundamental difference with respect to testability: while 9* is w-car- 
compatible with m, 9 is not s- car-compatible with m. Consequently, the 
s-car hypothesis, but not the w-car hypothesis, can be rejected against the 
unrestricted alternative S sa f by a likelihood ratio test (when m is induced 
by a sufficiently large sample). 

We can summarize the results for nonsaturated models as follows: since 
S s -car(©) satisfies pd for any parametric model 0, ignorability for likelihood- 
based inference is guaranteed by s-car. 

For w-car, even the weak ignorability condition that maximization of Lpv 
will give a maximum of Lp tW - car does not hold. The apparent advantage of 
s-car has to be interpreted with caution, however: whenever a maximum 
9 of Lpv maximizes Lp tS - car , but not Lp tW - car , then P^ cannot be s-car- 
compatible with m. Loosely speaking, this means that we obtain ignorability 
for maximum likelihood inference through s-car but not through w-car only 
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when the data contradicts the s-car assumption. The same data, on the 
other hand, might be consistent with w-car, but for inference under the w- 
car assumption the face- value likelihood has to be corrected with the c w - car 
factors. 

5. Conclusion. We can summarize the results of Sections 3 and 4 as 
follows: ignorability for maximum likelihood inference and categorical data 
holds under any of the following four modeling assumptions: 1. The w-car 
assumption for the coarsening mechanism and additional assumptions, such 
that the resulting coarse data model satisfies pd. 2. The s-car assumption as 
the sole assumption on the coarsening process. 3. The saturated complete- 
data model and w-car as the sole assumption on the coarsening process. 4. 
The saturated model for both the complete data and the coarsening mech- 
anism (but here there are more efficient ways of finding likelihood maxima 
than by maximizing the face- value likelihood). In particular, one must be 
aware of the fact that the joint assumption car + pd is ambiguous and can 
be inconsistent. This is because pd is not a well-defined modeling assump- 
tion one is free to make, but a model property one has to ensure by other 
assumptions. 

Overall the ignorability results obtained from s-car are somewhat stronger 
than those obtained from w-car. Points in favor of w-car, on the other hand, 
are its equivalence with the fair evidence condition, and the fact that it is 
invariant for different versions of the conditional distribution of observed 
(coarse) data. Furthermore, the w-car assumption can be consistent with a 
given parametric model and observed data when s-car is not (but not vice 
versa) . 
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